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Abstract 

A SNoW based learning approach to shallow pars- 
ing tasks is presented and studied experimentally. 
The approach learns to identify syntactic patterns by 
combining simple predictors to produce a coherent 
inference. Two instantiations of this approach are 
studied and experimental results for Noun-Phrases 
(NP) and Subject- Verb (SV) phrases that compare 
favorably with the best published results are pre- 
sented. In doing that, we compare two ways of mod- 
eling the problem of learning to recognize patterns 
and suggest that shallow parsing patterns are bet- 
tor learned uoing opon/clooo prodictoro than uoing 



This paper presents a general learning ap- 
proach for identifying syntactic patterns, based 



on the SNoW learning architecture ( Roth, 1998 ; 
Roth, 1999| ) . The SNoW learning architecture is 
a sparse network of linear functions over a pre- 
defined or incrementally learned feature space. 
SNoW is specifically tailored for learning in do- 
mains in which the potential number of infor- 
mation sources (features) taking part in deci- 
sions is very large - of which NLP is a princi- 
pal example. Preliminary versions of it have al- 
ready been used successfully on several tasks in 
natural language processing (|Roth, 1998 ; Gold 



inaidc/outaidc prcdictora. 



ing and Roth, 1999| ; |Roth and Zelenko, 1998|) . 



1 Introduction 

Shallow parsing is studied as an alternative to 
full-sentence parsers. Rather than producing a 
complete analysis of sentences, the alternative is 
to perform only partial analysis of the syntactic 
structures in a text (|Harris, 1957 ; Abney, 1991 



Greffenstette, 1993 ). Shallow parsing informa- 
tion such as NPs and other syntactic sequences 
have been found useful in many large-scale lan- 
guage processing applications including infor- 
mation extraction and text summarization. A 
lot of the work on shallow parsing over the past 
years has concentrated on manual construction 
of rules. The observation that shallow syntactic 
information can be extracted using local infor- 
mation - by examining the pattern itself, its 
nearby context and the local part-of-speech in- 
formation - has motivated the use of learning 



methods to recognize these patterns (Church 



1988t Ramshaw and Marcus^l995 



ah, 1998; Cardie and Pierce, 1998) 



Argamon et 
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In particular, SNoW's sparse architecture sup- 
ports well chaining and combining predictors to 
produce a coherent inference. This property of 
the architecture is the base for the learning ap- 
proach studied here in the context of shallow 
parsing. 

Shallow parsing tasks often involve the iden- 
tification of syntactic phrases or of words that 
participate in a syntactic relationship. Com- 
putationally, each decision of this sort involves 
multiple predictions that interact in some way. 
For example, in identifying a phrase, one can 
identify the beginning and end of the phrase 
while also making sure they are coherent. 

Our computational paradigm suggests using a 
SNoW based predictor building block that 
learns to perform each of the required predic- 
tions, and writing a simple program that acti- 
vates these predictors with the appropriate in- 
put, aggregates their output and controls the 
interaction between the predictors. Two instan- 
tiations of this paradigm are studied and eval- 
uated on two different shallow parsing tasks - 
identifying base NPs and SV phrases. The first 
instantiation of this paradigm uses predictors 
to decide whether each word belongs to the in- 



terior of a phrase or not, and then groups the 
words into phrases. The second instantiation 
finds the borders of phrases (beginning and end) 
and then pairs them in an "optimal" way into 
different phrases. These problems formulations 
are similar to those studied in (|Ramshaw and 
Marcus, 1995p and ( Church, 1988l TlAr" eamon eti 
ah, 1998 ), respectively. 

The experimental results presented using the 
SNoW based approach compare favorably with 
previously published results, both for NPs and 
SV phrases. As important, we present a few 
experiments that shed light on some of the is- 
sues involved in using learned predictors that 
interact to produce the desired inference. In 
particular, we exhibit the contribution of chain- 
ing: features that are generated as the output 
of one of the predictors contribute to the per- 
formance of another predictor that uses them as 
its input. Also, the comparison between the two 
instantiations of the learning paradigm - the In- 
side/Outside and the Open/Close - shows the 
advantages of the Open/Close model over the 
Inside/Outside, especially for the task of iden- 
tifying long sequences. 

The contribution of this work is in improving 
the state of the art in learning to perform shal- 
low parsing tasks, developing a better under- 
standing for how to model these tasks as learn- 
ing problems and in further studying the SNoW 
based computational paradigm that, we believe, 
can be used in many other related tasks in NLP. 



The rest of this paper is organized as follows : 



The SNoW aichiUxliue is piesenled in Sec. [||. 
Sec. ^ presents the shallow parsing tasks stud- 
ied and provides details on the computational 
approach. Sec. || describes the data used and 
the experimental approach, and Sec. |B| presents 
and discusses the experimental results. 

2 SNoW 

The SNoW (Sparse Network of WinnowsQ) 
learning architecture is a sparse network of lin- 
ear units over a common pre-defined or incre- 
mentally learned feature space. Nodes in the 
input layer of the network represent simple rela- 
tions over the input sentence and are being used 
as the input features. Each linear unit is called 
a target node and represents relations which are 
of interest over the input sentence; in the cur- 



rent application, target nodes may represent a 
potential prediction with respect to a word in 
the input sentence, e.g., inside a phrase, outside 
a phrase, at the beginning of a phrase, etc. An 
input sentence, along with a designated word 
of interest in it, is mapped into a set of fea- 
tures which are active in it; this representation 
is presented to the input layer of SNoW and 
propagates to the target nodes. Target nodes 
are linked via weighted edges to (some of the) 
input features. Let At = {ii, ■ ■ ■ , i m } be the set 
of features that are active in an example and 
are linked to the target node t. Then the linear 
unit is active iff YlieA w t > @t, where w\ is the 
weight on the edge connecting the ith feature 
to the target node t, and 6% is the threshold for 
the target node t. 

Each SNoW unit may include a collection of 
subnetworks, one for each of the target rela- 
tions. A given example is treated autonomously 
by each target subnetwork; an example labeled 
t may be treated as a positive example by the 
subnetwork for t and as a negative example by 
the rest of the target nodes. 

The learning policy is on-line and mistake- 
driven; several update rules can be used within 
SNoW. The most successful update rule, and 
the only one used in this work is a variant of 
Littlestone's (1988) Winnow update rule, a mul- 
tiplicative update rule tailored to the situation 
in which the set of input features is not known a 



priori, as in the infinite attribute model (Blum 



1992). This mechanism is implemented via the 



To winnow: to separate chaff from grain. 



sparse architecture of SNoW. That is, (1) input 
features are allocated in a data driven way - an 
input node for the feature % is allocated only if 
the feature i was active in any input sentence 
and (2) a link (i.e., a non-zero weight) exists 
between a target node t and a feature i if and 
only if i was active in an example labeled t. 

The Winnow update rule has, in addition to 
the threshold Ot at the target t, two update pa- 
rameters: a promotion parameter a > 1 and 
a demotion parameter < j3 < 1. These 
are being used to update the current represen- 
tation of the target t (the set of weights wj) 
only when a mistake in prediction is made. Let 
•At = {ii, • • ■ ,i m } be the set of active features 
that are linked to the target node t. If the algo- 
rithm predicts (that is, X^eA w i — anc ^ 
the received label is 1, the active weights in 



the current example are promoted in a multi- 



plicative fashion: Mi £ At,w\ 



a ■ w- 



If the 



algorithm predicts 1 (YlieAt w\ > 8 t ) and the 
received label is 0, the active weights in the cur- 
rent example are demoted Mi £ At-,w\ <— j3 ■ w\. 
All other weights are unchanged. 

The key feature of the Winnow update rule is 
that the number of examples required to learn 
a linear function grows linearly with the num- 
ber of relevant features and only logarithmically 
with the total number of features. This prop- 
erty seems crucial in domains in which the num- 
ber of potential features is vast, but a relatively 
small number of them is relevant. Winnow is 
known to learn efficiently any linear threshold 
function and to be robust in the presence of 
various kinds of noise and in cases where no 
linear-threshold function can make perfect clas- 
sifications, while still maintaining its abovemen- 
tioned dependence on the number of total and 



relevant attributes 


Littlestone, 1991 


and Warmuth, 1995 


)• 



Once target subnetworks have been learned 
and the network is being evaluated, a decision 
support mechanism is employed, which selects 
the dominant active target node in the SNoW 
unit via a winner-take-all mechanism to produce 
a final prediction. The decision support mech- 
anism may also be cached and processed along 
with the output of other SNoW units to produce 
a coherent output. 

3 Modeling Shallow Parsing 
3.1 Task Definition 

This section describes how we model the shallow 
parsing tasks studied here as learning problems. 
The goal is to detect NPs and SV phrases. Of 
the several slightly different definitions of a base 
NP in the literature we use for the purposes of 



this work the definition presented in (Ramshaw 
and Marcus, 1995) and used also by ( [Argamon 
et al., 1998 ) and others. That is, a base NP 
is a non-recursive NP that includes determin- 
ers but excludes post-modifying prepositional 
phrases or clauses. For example: 
...presented [last year] in 
[ Illinois ] in front of ... 

SV phrases, following the definition suggested 
in ( Argamon et al., 1998| ), are word phrases 
starting with the subject of the sentence and 
ending with the first verb, excluding modal 



verbsQ. For example, the SV phrases are brack- 
eted in the following: 

...presented [a theory that claims] 
that [ the algorithm runs ] and 
performs . . . 

Both tasks can be viewed as sequence recog- 
nition problems. This can be modeled as a col- 
lection of prediction problems that interact in 
a specific way. For example, one may predict 
the first and last word in a target sequence. 
Moreover, it seems plausible that information 
produced by one predictor (e.g., predicting the 
beginning of the sequence) may contribute to 
others (e.g., predicting the end of the sequence). 

Therefore, our computational paradigm sug- 
gests using SNoW predictors that learn sepa- 
rately to perform each of the basic predictions, 
and chaining the resulting predictors at evalu- 
ation time. Chaining here means that the pre- 
dictions produced by one of the predictors may 
be used as (a part of the) input to others^]. 

Two instantiations of this paradigm - each of 
which models the problems using a different set 
of predictors - are described below. 

3.2 Inside/Outside Predictors 

The predictors in this case are used to decide, 
for each word, whether it belongs to the inte- 
rior of a phrase or not; this information is then 
used to group the words into phrases. Since 
annotating words only with Inside/Outside in- 
formation is ambiguous in cases of two consec- 
utive phrases, an additional predictor is used. 
Specifically, each word in the sentence may be 
annotated using one of the following labels: O 
- the current word is outside the pattern. I - 
the current word is inside the pattern. B - the 
current word marks the beginning of a pattern 
that immediately follows another pattern^. 



2 Notice that according to this definition the identified 
verb may not correspond to the subject, but this phrase 
still contains meaningful information; in any case, the 
learning method presented is independent of the specific 
definition used. 

3 The input data used in all the experiments 
presented here consists of part-of-speech tagged 
data. In the demo of the system (available fro m 



trttp : //12r . cs .uiuc . edu/~cogcomp/eoh/index .html ), 

an additional layer of chaining is used. Raw sentences 
are supplied as inp ut and are processed usi ng a SNoW 
based POS tagger ([Roth and Zelenko, 1998|) first. 



4 There are other ways to define the B annotation, 
e.g., as always marking the beginning of a phrase. The 
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Figure 1: Architecture for Inside/Outside method. 



For example, the sentence I went to 
California last May would be marked for 
base NPs as: 

I went to California last May 
IOO I B I 

indicating that the NPs are I, California and 
last May. This approach has been studied 
in ( [Ramshaw and Marcus, 1995| ). 

3.2.1 Architecture 

SNoW is used in order to learn the OIB an- 
notations both for NPs and SV phrases. In 
each case, two predictors are learned, which dif- 
fer in the type of information they receive in 
their input. A first predictor takes as input a 



modeling used, however, turns out best experimentally. 



sentence along with the corresponding part-of- 
speech (POS) tags. The features extracted from 
this input represent the local context of each 
word in terms of POS tags (with the possible 
addition of lexical information) , as described in 
Sec 3.4. The SNoW predictor in this case con- 
sists of three targets - O, I and B. Figure [j] 
depicts the feature extraction module which ex- 
tracts the local features and generates an exam- 
ple for each word in the sentence. Each example 
is labeled with one of O, I or B. 

The second predictor takes as input a sen- 
tence along with the corresponding POS tags 
as well as the Inside/Outside information. 
The hope is that representing the local context 
of a word using the Inside/ Outside information 
for its neighboring words, in addition to the 
POS and lexical information, will enhance the 
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Figure 2: Architecture for Open/Close Method. 



performance of the predictor. While this in- 
formation is available during training, since the 
data is annotated with the OIB information, it 
is not available in the input sentence at evalua- 
tion time. Therefore, at evaluation time, given 
a sentence (represented as a sequence of POS 
tags), we first need to evaluate the first pre- 
dictor on it, generate an Inside/Outside repre- 
sentation of the sentence, and then use this to 
generate new features that feed into the second 
predictor. 

3.3 Open/Close Predictors 

The predictors in this case are used to decide, 
for each word, whether it is the first in a phrase, 
the last in a phrase, both of these, or none 
of these. In this way, the phrase boundaries 
are determined; this is annotated by placing an 
open bracket ( [ ) before the first word and a 
close bracket ( ] ) after the last word of each 
phrase. Our earlier example would be marked 
for base NPs as: [I] went to [California] 
[last May] . This approach has been studied 
in QChurch, 1988| ; |Argamon et al., 1998|) . 



3.3.1 Architecture 

The architecture used for the Open/Close pre- 
dictors is shown in Figure |2|. Two SNoW pre- 
dictors are used, one to predict if the word cur- 
rently in consideration is the first in the phrase 
(an open bracket), and the other to predict if it 
is the last (a close bracket). Each of the two pre- 
dictors is a SNoW network with two competing 
target nodes: one predicts if the current posi- 
tion is an open (close) bracket and the other 
predicts if it is not. In this case, the actual 
activation value (sum of weights of the active 
features for a given target) of the SNoW pre- 
dictors is used to compute a confidence in the 
prediction. Let ty be the activation value for 
the yes-bracket target and ijv for the no-bracket 
target. Normally, the network would predict 
the target corresponding to the higher activa- 
tion value. In this case, we prefer to cache the 
system preferences for each of the open (close) 
brackets predictors so that several bracket pair- 
ings can be considered when all the information 
is available. The confidence, 7, of a candidate is 
defined by 7 = t Y /{t Y + t N ). Normally, SNoW 
will predict that there is a bracket if 7 ^ 0.5, 
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Figure 3: Example of combinator assignment. Subscripts denote the confidence of the bracket 
candidates. Bracket candidates that would be chosen by the combinator are marked with a *. 



but this system employs an threshold r. We 
will consider any bracket that has 7 ) t as a 
candidate. The lower r is, the more candidates 
will be considered. 

The input to the open bracket predictor is 
a sentence and the POS tags associated with 
each word in the sentence. For each position 
in the sentence, the open bracket predictor de- 
cides if it is a candidate for an open bracket. For 
each open bracket candidate, features that cor- 
respond to this information are generated; the 
close bracket predictor can (potentially) receive 
this information in addition to the sentence and 
the POS information, and use it in its decision 
on whether a given position in the sentence is 
to be a candidate for a close bracket predictor 
(to be paired with the open bracket candidate). 

3.3.2 Combinator 

Finding the final phrases by pairing the open 
and close bracket candidates is crucial to the 
performance of the system; even given good 
prediction performance choosing an inadequate 
pairing would severely lower the overall perfor- 
mance. We use a graph based method that uses 
the confidence of the SNoW predictors to gener- 
ate the consistent pairings, at only a linear time 
complexity. 

We call p = (o, c) a pair, where o is an open 
bracket and c is any close bracket that was pre- 
dicted with respect to o. The position of a 
bracket at the ith word is defined to be i if 
it is an open bracket and i + 1 if it is a close 
bracket. Clearly, a pair (o, c) is possible only 
when pos{o) < pos(c). The confidence of a 
bracket t is the weight 7(i). The value of a pair 
p = (o, c) is defined to be v(p) = 7(0) * 7(c). 
The pair p\ occurs before the pair P2 if pos{c\) ^ 
pos{o2)- pi andp2 are compatible if either p\ oc- 
curs before P2 or P2 occurs before p\. A pairing 
is a set of pairs P = {pi,P2, ■ ■ - Pn} such that pi 
is compatible with pj for all i and j where i ^ j. 



The value of the pairing is the sum of all of the 
values of the pairs within the pairing. 

Our combinator finds the pairing with the 
maximum value. Note that while there may be 
exponentially many pairings, by modeling the 
problem of finding the maximum valued pairing 
shortest path problem on a directed acyclic 
graph, we provide a linear time solution. Fig- 
ure IH gives an example of pairing bracket can- 
didates of the sentence S = S1S2S3S4S5SQ, where 
the confidence of each candidate is written in 
the subscript. 

3.4 Features 

The features used in our system are relational 
features over the sentence and the POS informa- 
tion, which can be defined by a pair of numbers, 
k and w. Specifically, features are either word 
conjunctions or POS tags conjunctions. All con- 
junctions of size up to k and within a symmetric 
window that includes the w words before and 
after the designated word are generated. 

An example is shown in Figure |^ where 
(w, k) = (3, 4) for POS tags, and (w, k) = (1, 2) 
for words. In this example the word "how" is 
the designated word with POS tag "WRB". "()" 
marks the position of the current word (tag) 
if it is not part of the feature, and "(how)" 
or "(WRB)" marks the position of the current 
word (tag) if it is part of the current feature. 
The distance of a conjunction from the current 
word (tag) can be induced by the placement of 
the special character "_" in the feature. We do 
not consider mixed features between words and 



POS tags as in ( Ramshaw and Marcus, 1995 ), 
that is, a single feature consists of either words 
or tags. 

Additionally, in the Inside/Outside model, 
the second predictor incorporates as features 
the OIB status of the w words before and af- 
ter the designated word, and the conjunctions 
of size 2 of the words surrounding it. 
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Figure 4: An example of feature extraction. 



Data 


Sentences 


Words 


NP Patterns 


Training 
Test 


8936 
2012 


211727 
47377 


54758 
12335 



Table 1: Sizes of the training and test data sets for NP Patterns. 



Data 


Sentences 


Words 


SV Patterns 


Training 
Test 


16397 
1921 


394854 
46451 


25024 
3044 



Table 2: Sizes of the training and test data sets for SV Patterns. 



4 Methodology 
4.1 Data 

In order to be able to compare our results with 
the results obtained by other researchers, we 
worked with the same data sets already used 
by flRamshaw and Marcus, 199E ; Argamon et 
al., 1998j ) for NP and SV detection. These data 
sets were based on the Wall Street Journal cor- 



pus in the Penn Treebank (Marcus et al., 1993). 
For NP, the training and test corpus was pre- 
pared from sections 15 to 18 and section 20, 
respectively; the SV corpus was prepared from 
sections 1 to 9 for training and section for 
testing. Instead of using the NP bracketing in- 
formation present in the tagged Treebank data, 
Ramshaw and Marcus modified the data so as 



to include bracketing information related only 
to the non-recursive, base NPs present in each 
sentence while the subject verb phrases were 
taken as is. The data sets include POS tag 
information generated by Ramshaw and Mar- 
cus using Brill's transformational part-of-speech 
tagger ( Brill, 1995|) . 

The sizes of the training and test data are 
summarized in Table |l| and Table ||. 

4.2 Parameters 

The Open/Close system has two adjustable pa- 
rameters, Tr and ti, the threshold for the open 
and close bracket predictors, respectively. For 
all experiments, the system is first trained on 
90% of the training data and then tested on the 
remaining 10%. The rj and T[ that provide the 



best performance are used on the real test file. 
After the best parameters are found, the system 
is trained on the whole training data set. Re- 
sults are reported in terms of recall, precision, 
and Fa. Fa is always used as the single value to 



tend to break up long base NPs into smaller 
ones. 

The results also show that lexical information 
improves the performance by nearly 2%. This 



compare the performance. 



For all the experiments, we use 1 as the initial 
weight, 5 as the threshold, 1.5 as a, and 0.7 as 
j3 to train SNoW, and it is always trained for 2 
cycles. 

4.3 Evaluation Technique 

To evaluate the results, we use the following 
metrics: 



is similar to results in the literature ( Ramshaw 
and Marcus, 1995| ) . What we found surprising is 



Recall 



Number of correct proposed patterns 
Number of correct patterns 



that the second predictor, that uses additional 
information about the OIB status of the local 
context, did not do much better than the first 
predictor, which relies only on POS and lexical 
information. A control experiment has verified 
that this is not due to the noisy features that the 
first predictor supplies to the second predictor. 

Finally, the Inside/Outside method was also 
tested on predicting SV phrases, yielding poor 
results that are not shown here. An attempt at 
explaining this phenomena by breaking down 
performance according to the length of the 



Precision 



Accuracy 



We use (3 = 1. Note that, for the Open/Close 
system, we must measure the accuracy for the 
open predictor and the close predictor sepa- 
rately since each word can be labeled as "Open" 
or "Not Open" and, at the same time, "Close" 
or "Not Close". 

5 Experimental Results 
5.1 Inside/Outside 

The results of each of the predictors used 
in the Inside/Outside method are presented 
in Table [3|. The results are comparable to 
other results reported using the Inside/Outside 
method ( Ramshaw and Marcus, 1995[) (see Ta- 
ble [/]. We have observed that most of the mis- 
taken predictions of base NPs involve predic- 
tions with respect to conjunctions, gerunds, ad- 
verbial NPs and some punctuation marks. As 
reported in ( Argamon et al., 1998|) , most base 



phrases is discussed in Sec. 5.3. 



Number of correct proposed patterns 
Number of proposed patterns 



(/3 2 + 1) • Recall • Precision 
(5 2 ■ Precision + Recall 



Number of words labeled correctly 
Total number of words 



NPs present in the data are less or equal than 
4 words long. This implies that our predictors 



5.2 Open/Close 

The results of the Open/Close method for NP 
and SV phrases are presented in Table |j. In ad- 
dition to the good overall performance, the re- 
sults show significant improvement by incorpo- 
rating the lexical information into the features. 
In addition to the recall/precision results we 
have also presented the accuracy of each of the 
Open and Close predictors. These are impor- 
tant since they determine the overall accuracy 
in phrase detection. It is evident that the pre- 
dictors perform very well, and that the overall 
performance degrades due to inconsistent pair- 
ings. 

An important question in the learning ap- 
proach presented here is investigating the gain 
achieved due to chaining. That is, whether the 
features extracted from open brackets can im- 
prove the performance of the the close bracket 
predictor. To this effect, we measured the ac- 
curacy of the close bracket predictor itself, on a 
word basis, by supplying it features generated 
from correct open brackets. We compared this 
with the same experiment, only this time with- 
out incorporating the features from open brack- 
ets to the close bracket predictor. The results, 
shown in Table |H| indicate a significant contribu- 
tion due to chaining the features. Notice that 
the overall accuracy for the close bracket pre- 
dictor is very high. This is due to the fact that, 
as shown in Table [2|, there are many more neg- 
ative examples than positive examples. Thus, a 



Method 


Recall 


Precision 


Fp=\ 


Accuracy 


First Predictor 


90.5 


89.8 


90.1 


96.9 


Second Predictor 


90.5 


90.4 


90.4 


97.0 


First Predictor + lexical 


92.5 


92.2 


92.4 


97.6 


Second Predictor + lexical 


92.5 


92.1 


92.3 


97.6 



Table 3: Results for NP detection using Inside/Outside method. 



Method 


Recall 


Precision 




Accuracy 










Open 


Close 


SV w/o lexical 


88.3 


87.9 


88.1 


98.6 


99.4 


SV with lexical 


91.9 


92.2 


92.0 


99.2 


99.4 


NP w/o lexical 


90.9 


90.3 


90.6 


97.4 


97.8 


NP with lexical 


93.1 


92.4 


92.8 


98.1 


98.2 



Table 4: Results for SV Phrase and NP detection using Open/Close method. 



T 


Without Open info 


With Open bracket info 


Overall 


Positive Only 


Overall 


Positive Only 


0.5 


99.3 


92.7 


99.4 


95.0 



Table 5: Accuracy of close bracket predictor when using features created on local information alone 
versus using additional features created from the open bracket candidate. Overall performance and 
performance on positive examples only is shown. 



predictor that always predicts "no" would have 
an accuracy of 93.4%. Therefore, we considered 
also the accuracy over positive examples, which 
indicates the significant role of the chaining. 

5.3 Discussion 

Both methods we study here - Inside/Outside 
and Open/Close - have been evaluated before 
(using different learning methods) on similar 
tasks. However, in this work we have allowed for 
a fair comparison between two different models 
by using the same basic learning method and 
the same features. 

Our main conclusion is with respect to the 
robustness of the methods to sequences of dif- 
ferent lengths. While both methods give good 
results for the base NP problem, they differ sig- 
nificantly on the SV tasks. Furthermore, our 
investigation revealed that the Inside/Outside 
method is very sensitive to the length of the 
phrases. Table || shows a breakdown of the per- 
formance of the two methods on SV phrases of 



different lengths. Perhaps this was not observed 
earlier since (Ramshaw and Marcus, 1995) stud- 
ied only base NPs, most of which are short. 
The conclusion is therefore that the Open / Close 
method is more robust, especially when the tar- 
get sequences are longer than a few tokens. 

Finally, Tables [?] and ^ present a comparison 
of our methods to some of the best NP and SV 
results published on these tasks. 

6 Conclusion 

We have presented a SNoW based learning ap- 
proach to shallow parsing tasks. The learning 
approach suggests to identify a syntactic pat- 
terns is performed by writing a simple program 
in which several instantiations of SNoW learn- 
ing units are chained and combined to produce 
a coherent inference. Two instantiations of this 
approach have been described and shown to per- 
form very well on NP and SV phrase detection. 
In addition to exhibiting good results on shallow 



Length 


Patterns 


Inside/Outside 


Open/Close 


Recall 


Precision 




Recall 


Precision 


F/3=i 


^4 


2212 


90.5 


61.5 


73.2 


94.1 


93.5 


93.8 


4 < I < 8 


509 


61.4 


44.1 


51.3 


72.3 


79.7 


75.8 


> 8 


323 


30.3 


15.0 


20.0 


74.0 


64.4 


68.9 



Table 6: Comparison of Inside/Outside and Open/Close on SV patterns of varying lengths. 



Method 


Recall 


Precision 




Accuracy 


Inside/Outside 


90.5 


90.4 


90.4 




97.0 


Inside/Outside + lexical 


92.5 


92.2 


92.4 




97.6 


Open/Close 


90.9 


90.3 


90.6 


O: 


97.4, C: 97.8 


Open/Close + lexical 


93.1 


92.4 


92.8 


O: 


98.1, C: 98.2 


Ramshaw & Marcus 


90.7 


90.5 


90.6 




97.0 


Ramshaw & Marcus + lexical 


92.3 


91.8 


92.0 




97.4 


Argamon et al. 


91.6 


91.6 


91.6 




N/A 



Table 7: Comparison of Results for NP. In the accuracy column, O indicates the accuracy of the 
Open predictor and C indicates the accuracy of the Close predictor. 



Method 


Recall 


Precision 


F/3=i 


Accuracy 


Open/Close 
Open/Close + lexical 


88.3 
91.9 


87.9 
92.2 


88.1 
92.0 


O: 98.6, C: 99.4 
O: 99.2, C: 99.4 


Argamon et al. 


84.5 


88.6 


86.5 


N/A 



Table 8: Comparison of Results for SV. In the accuracy column, O indicates the accuracy of the 
Open predictor and C indicates the accuracy of the Close predictor. 



parsing tasks, we have made some observations 
on the sensitivity of modeling the task. We be- 
lieve that the paradigm described here, as well 
as the basic learning system, can be used in this 
way in many problems that are of interest to the 
NLP community. 
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